Multiple previous studies have shown that members of the family of nuclear factor kB (NF-kB) transcription factors can play very important roles in mulitple cellular processes (immune response, cell proliferation, cell survival, etc.). Activation of canonical NF-kB signaling pathways are fairly well studied, but there is a lack of information regarding the activation of noncanonical NF-kB signaling pathways. Using high-content phenotypic screening, Henry et al. (2018) were able to identify cyclin-dependent kinase 12 (CDK12) as a regulator of this pathway. Further, chemoproteomics identified the chemical compound 919278 which targets CDK12, specifically functioning as an inhibitor of CDK12. To better understand global trancsriptional changes, the authors sequenced the transcriptome of a human osteosarcoma cell line (U2OS) and treated the cells in a range of different experimental conditions (stimulation (TWEAK or unstimulated), time of treatment/stimulation (24hr or 4hr), and treatments (BIO0919278, BIO0702697, BIO032202, DMSO, NTsiRNA, siRNAs523626, siRNAs523629)).
It is important to note that BIO0702697 is the S-enantiomer of BIO0919278. Using high-throughput screening techniques, the authors found that BIO0702697 was significantly less potent of a compound relative to BIO0919278 relative to activating the noncanonical NF-kB signaling pathway. The main goal of the differential expression analysis in this data analysis is to analyze differential gene expression between the two enantiomer treatments (BIO0702697 and BIO0919278). This type of analysis would help us determine the extent to which stereochemistry of these molecules affects pathways related to noncanonical NF-kB signaling. It will ideally help us validate and provide further evidence for the results in the original paper.
The resulting raw RNA-seq data have been deposited at the Gene Expression Omnibus (GEO), where they are publicly available under accession GSE113926. Here, we show a first exploratory analysis of the corresponding RNA-seq gene expression profiles generated as a table of counts using the DEE2 (http://dee2.io) pipeline by Ziemann, Kaspi, and El-Osta (2019), and further packaged into a SummarizedExperiment object with genes mapped to Entrez identifiers. This object also stores the phenotypic information about the profiled samples that has been also made available at GEO.
We start importing the raw table of counts.
library(SummarizedExperiment)
se <- readRDS(usethis::proj_path(file.path("data", "data.rds")))
se
class: RangedSummarizedExperiment
dim: 25122 44
metadata(4): experimentData annotation ensemblVersion urlProcessedData
assays(1): counts
rownames(25122): 1 10 ... 9994 9997
rowData names(5): gene_id gene_biotype description gene_id_version
symbol
colnames(44): SRR7089369 SRR7089370 ... SRR7089423 SRR7089424
colData names(44): title geo_accession ... time:ch1 treatment:ch1
We have 25122 genes by 44 samples. From the first row and column names shown by the object, we can figure out that genes are defined by Entrez (Maglott et al. 2010) identifiers and samples by Sequence Read Archive Run ([SRR] (https://www.ncbi.nlm.nih.gov/books/NBK56913/#search.what_do_the_different_sra_accessi)) identifiers.
The row data in this object contains information about the profiled genes.
head(rowData(se))
DataFrame with 6 rows and 5 columns
gene_id gene_biotype
<character> <character>
1 ENSG00000121410 protein_coding
10 ENSG00000156006 protein_coding
100 ENSG00000196839 protein_coding
1000 ENSG00000170558 protein_coding
10000 ENSG00000117020 protein_coding
100008586 ENSG00000236362 protein_coding
description
<character>
1 alpha-1-B glycoprotein [Source:HGNC Symbol;Acc:HGNC:5]
10 N-acetyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:7646]
100 adenosine deaminase [Source:HGNC Symbol;Acc:HGNC:186]
1000 cadherin 2 [Source:HGNC Symbol;Acc:HGNC:1759]
10000 AKT serine/threonine kinase 3 [Source:HGNC Symbol;Acc:HGNC:393]
100008586 G antigen 12F [Source:HGNC Symbol;Acc:HGNC:31906]
gene_id_version symbol
<character> <character>
1 ENSG00000121410.11 A1BG
10 ENSG00000156006.4 NAT2
100 ENSG00000196839.12 ADA
1000 ENSG00000170558.8 CDH2
10000 ENSG00000117020.16 AKT3
100008586 ENSG00000236362.8 GAGE12F
Among this information, the gene symbol and description are potentially useful for interpreting results of, for instance, a differential expression analysis. Let’s explore now the column (phenotypic) data.
dim(colData(se))
[1] 44 44
head(colData(se), n=3)
DataFrame with 3 rows and 44 columns
title geo_accession status
<factor> <character> <factor>
SRR7089369 UnStim-DMSO-4hr-A GSM3123750 Public on Jul 31 2018
SRR7089370 TWEAK-DMSO-4hr-A GSM3123751 Public on Jul 31 2018
SRR7089373 UnStim-BIO0919278-4hr-A GSM3123754 Public on Jul 31 2018
submission_date last_update_date type channel_count
<factor> <factor> <factor> <factor>
SRR7089369 May 01 2018 Jul 31 2018 SRA 1
SRR7089370 May 01 2018 Jul 31 2018 SRA 1
SRR7089373 May 01 2018 Jul 31 2018 SRA 1
source_name_ch1 organism_ch1 characteristics_ch1
<factor> <factor> <factor>
SRR7089369 U2OS cell Homo sapiens stimulation: UnStim
SRR7089370 U2OS cell Homo sapiens stimulation: TWEAK
SRR7089373 U2OS cell Homo sapiens stimulation: UnStim
characteristics_ch1.1 characteristics_ch1.2 characteristics_ch1.3
<factor> <factor> <factor>
SRR7089369 treatment: DMSO time: 4hr cell line: U2OS
SRR7089370 treatment: DMSO time: 4hr cell line: U2OS
SRR7089373 treatment: BIO0919278 time: 4hr cell line: U2OS
molecule_ch1
<factor>
SRR7089369 polyA RNA
SRR7089370 polyA RNA
SRR7089373 polyA RNA
extract_protocol_ch1
<factor>
SRR7089369 RNA libraries were prepared for sequencing using standard Illumina protocols
SRR7089370 RNA libraries were prepared for sequencing using standard Illumina protocols
SRR7089373 RNA libraries were prepared for sequencing using standard Illumina protocols
taxid_ch1 description
<factor> <factor>
SRR7089369 9606 replicate A
SRR7089370 9606 replicate A
SRR7089373 9606 replicate A
data_processing
<factor>
SRR7089369 Reads were aligned to reference genome (hg19) using STAR aligner
SRR7089370 Reads were aligned to reference genome (hg19) using STAR aligner
SRR7089373 Reads were aligned to reference genome (hg19) using STAR aligner
data_processing.1
<factor>
SRR7089369 QC includes sequence quality, GC content, 5’-3’ gene body coverage (Supplementary Table S6)
SRR7089370 QC includes sequence quality, GC content, 5’-3’ gene body coverage (Supplementary Table S6)
SRR7089373 QC includes sequence quality, GC content, 5’-3’ gene body coverage (Supplementary Table S6)
data_processing.2
<factor>
SRR7089369 Outlier detection absolute Z score > 2 was applied on overall sequencing quality score, 5 prime coverage, 3 prime coverage, mean_GC content, duplication rate, mean_ and mapped percentage. Sample with absolute Z score > 2 would be discarded, which did not apply to this study.
SRR7089370 Outlier detection absolute Z score > 2 was applied on overall sequencing quality score, 5 prime coverage, 3 prime coverage, mean_GC content, duplication rate, mean_ and mapped percentage. Sample with absolute Z score > 2 would be discarded, which did not apply to this study.
SRR7089373 Outlier detection absolute Z score > 2 was applied on overall sequencing quality score, 5 prime coverage, 3 prime coverage, mean_GC content, duplication rate, mean_ and mapped percentage. Sample with absolute Z score > 2 would be discarded, which did not apply to this study.
data_processing.3
<factor>
SRR7089369 Aligned reads were counted against gene model annotation (Gencode v18) to obtain expression values by using FeatureCounts
SRR7089370 Aligned reads were counted against gene model annotation (Gencode v18) to obtain expression values by using FeatureCounts
SRR7089373 Aligned reads were counted against gene model annotation (Gencode v18) to obtain expression values by using FeatureCounts
data_processing.4 data_processing.5
<factor> <factor>
SRR7089369 DESeq2 was used for gene expression normalization Genome_build: hg19
SRR7089370 DESeq2 was used for gene expression normalization Genome_build: hg19
SRR7089373 DESeq2 was used for gene expression normalization Genome_build: hg19
data_processing.6
<factor>
SRR7089369 Supplementary_files_format_and_content: Tab-delimited text files include RPKM values for each Sample:
SRR7089370 Supplementary_files_format_and_content: Tab-delimited text files include RPKM values for each Sample:
SRR7089373 Supplementary_files_format_and_content: Tab-delimited text files include RPKM values for each Sample:
data_processing.7
<factor>
SRR7089369 DESeq2_normalized_count_matrix.txt: Count matrix for counts per gene, normalized for library size using DESeq2
SRR7089370 DESeq2_normalized_count_matrix.txt: Count matrix for counts per gene, normalized for library size using DESeq2
SRR7089373 DESeq2_normalized_count_matrix.txt: Count matrix for counts per gene, normalized for library size using DESeq2
data_processing.8
<factor>
SRR7089369 DESeq2_regularized_log_transformed.txt: Made using the rlogTransformation function in DESeq2. This function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. These are the values used to obtain clustering and PCA results.
SRR7089370 DESeq2_regularized_log_transformed.txt: Made using the rlogTransformation function in DESeq2. This function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. These are the values used to obtain clustering and PCA results.
SRR7089373 DESeq2_regularized_log_transformed.txt: Made using the rlogTransformation function in DESeq2. This function transforms the count data to the log2 scale in a way which minimizes differences between samples for rows with small counts, and which normalizes with respect to library size. These are the values used to obtain clustering and PCA results.
data_processing.9
<factor>
SRR7089369 QC_statistics_to_deliver.txt: Tot_reads(M)- Total number of reads in the sample, in millions rRNA(%)- Percent of reads aligned to ribosomal RNA Map(%)- Percent of reads which map to genome UQ_map(%)- Percent of reads which map to only one locus on the genome Gene_assn(%)- Of uniquely mapped reads, percentage which may be assigned to exons of genes by gene counting program featureCounts Strand(%)- Percentage of reads which are reversely stranded. If unstranded protocol, this is not counted and listed as 0. 5prime_cov- Mean normalized coverage from Picard for percentiles 11-30 along gene body 3prime_cov- Mean normalized coverage from Picard for percentiles 71-90 along gene body Mean_GC(%)- Mean GC content averaging GC content for each read across all reads Dup(%)- Percentage of reads which were duplicates Mean_inner_dist- Inner distance = Read 2 start - Read 1 end. Negative indicates overlap. CDS, UTR, Intronic, and Intergenic(%)- Percentage of reads assigned to different regions of the genome. These four add up to 100%. chrM(%)- For reads assigned to genes, percentage assigned to mitochondrial genes. Top_ten(%)- For reads assigned to genes, percentage assigned to the top 10 most highly expressed genes.
SRR7089370 QC_statistics_to_deliver.txt: Tot_reads(M)- Total number of reads in the sample, in millions rRNA(%)- Percent of reads aligned to ribosomal RNA Map(%)- Percent of reads which map to genome UQ_map(%)- Percent of reads which map to only one locus on the genome Gene_assn(%)- Of uniquely mapped reads, percentage which may be assigned to exons of genes by gene counting program featureCounts Strand(%)- Percentage of reads which are reversely stranded. If unstranded protocol, this is not counted and listed as 0. 5prime_cov- Mean normalized coverage from Picard for percentiles 11-30 along gene body 3prime_cov- Mean normalized coverage from Picard for percentiles 71-90 along gene body Mean_GC(%)- Mean GC content averaging GC content for each read across all reads Dup(%)- Percentage of reads which were duplicates Mean_inner_dist- Inner distance = Read 2 start - Read 1 end. Negative indicates overlap. CDS, UTR, Intronic, and Intergenic(%)- Percentage of reads assigned to different regions of the genome. These four add up to 100%. chrM(%)- For reads assigned to genes, percentage assigned to mitochondrial genes. Top_ten(%)- For reads assigned to genes, percentage assigned to the top 10 most highly expressed genes.
SRR7089373 QC_statistics_to_deliver.txt: Tot_reads(M)- Total number of reads in the sample, in millions rRNA(%)- Percent of reads aligned to ribosomal RNA Map(%)- Percent of reads which map to genome UQ_map(%)- Percent of reads which map to only one locus on the genome Gene_assn(%)- Of uniquely mapped reads, percentage which may be assigned to exons of genes by gene counting program featureCounts Strand(%)- Percentage of reads which are reversely stranded. If unstranded protocol, this is not counted and listed as 0. 5prime_cov- Mean normalized coverage from Picard for percentiles 11-30 along gene body 3prime_cov- Mean normalized coverage from Picard for percentiles 71-90 along gene body Mean_GC(%)- Mean GC content averaging GC content for each read across all reads Dup(%)- Percentage of reads which were duplicates Mean_inner_dist- Inner distance = Read 2 start - Read 1 end. Negative indicates overlap. CDS, UTR, Intronic, and Intergenic(%)- Percentage of reads assigned to different regions of the genome. These four add up to 100%. chrM(%)- For reads assigned to genes, percentage assigned to mitochondrial genes. Top_ten(%)- For reads assigned to genes, percentage assigned to the top 10 most highly expressed genes.
data_processing.10
<factor>
SRR7089369 featureCounts_count_matrix.txt: Raw count matrix with counts per gene as obtained by featureCounts
SRR7089370 featureCounts_count_matrix.txt: Raw count matrix with counts per gene as obtained by featureCounts
SRR7089373 featureCounts_count_matrix.txt: Raw count matrix with counts per gene as obtained by featureCounts
data_processing.11
<factor>
SRR7089369 gene_annotation_info.txt: More info on the genes in the count matrices, such as gene coordinates, biotype (coding vs. noncoding), etc.
SRR7089370 gene_annotation_info.txt: More info on the genes in the count matrices, such as gene coordinates, biotype (coding vs. noncoding), etc.
SRR7089373 gene_annotation_info.txt: More info on the genes in the count matrices, such as gene coordinates, biotype (coding vs. noncoding), etc.
data_processing.12
<factor>
SRR7089369 kallisto_TPM_matrix.txt: Per transcript in the annotation, the TPM (transcripts per million) value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
SRR7089370 kallisto_TPM_matrix.txt: Per transcript in the annotation, the TPM (transcripts per million) value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
SRR7089373 kallisto_TPM_matrix.txt: Per transcript in the annotation, the TPM (transcripts per million) value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
data_processing.13
<factor>
SRR7089369 kallisto_est_count_matrix.txt: Per transcript in the annotation, the estimated counts value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
SRR7089370 kallisto_est_count_matrix.txt: Per transcript in the annotation, the estimated counts value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
SRR7089373 kallisto_est_count_matrix.txt: Per transcript in the annotation, the estimated counts value given by Kallisto. If this file is not present, then Kallisto was not run for this project.
platform_id data_row_count instrument_model library_selection
<factor> <factor> <factor> <factor>
SRR7089369 GPL16791 0 Illumina HiSeq 2500 cDNA
SRR7089370 GPL16791 0 Illumina HiSeq 2500 cDNA
SRR7089373 GPL16791 0 Illumina HiSeq 2500 cDNA
library_source library_strategy
<factor> <factor>
SRR7089369 transcriptomic RNA-Seq
SRR7089370 transcriptomic RNA-Seq
SRR7089373 transcriptomic RNA-Seq
relation
<factor>
SRR7089369 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN09007224
SRR7089370 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN09007223
SRR7089373 BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN09007220
relation.1
<factor>
SRR7089369 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX4018354
SRR7089370 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX4018355
SRR7089373 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX4018358
supplementary_file_1 cell line:ch1 stimulation:ch1 time:ch1
<factor> <character> <character> <character>
SRR7089369 NONE U2OS UnStim 4hr
SRR7089370 NONE U2OS TWEAK 4hr
SRR7089373 NONE U2OS UnStim 4hr
treatment:ch1
<character>
SRR7089369 DMSO
SRR7089370 DMSO
SRR7089373 BIO0919278
We have a total of 44 phenotypic variables. The second column geo_accession contains GEO Sample Accession Number
(GSM) identifers. GSM identifiers define individual samples, understood in this context as individual sources of RNA. If the GSM identifiers are repeated, this indicates that among the 44 samples there are technical replicates. We can figure out how many technical replicates per GSM sample we have as follows:
length(unique(se$geo_accession))
[1] 44
table(lengths(split(colnames(se), se$geo_accession)))
1
44
This has shown us that there are no technical replicates among the 44 samples (all GSM identifiers are unique).
To proceed further exploring this dataset, we are going to use the edgeR package and build a DGEList object, incorporating the gene metadata, which includes the gene symbol.
library(edgeR)
dge <- DGEList(counts=assays(se)$counts, genes=rowData(se))
dim(dge)
[1] 25122 44
We will now calculate the \(\log_2\) CPM units of expression and put them as an additional assay element to ease their manipulation. CPM is a form of within-sample scaling using counts per million reads (CPM) mapped to the genome.
assays(se)$logCPM <- cpm(dge, log=TRUE)
assays(se)$logCPM[1:5, 1:5]
SRR7089369 SRR7089370 SRR7089373 SRR7089374 SRR7089375
1 -1.168200 -0.5456482 -0.2786859 -1.523985 -1.334666
10 -2.234508 -1.7036348 -2.2345084 -2.234508 -2.234508
100 3.904497 4.0985553 4.4986419 4.357345 4.027829
1000 7.875925 7.8886217 7.9733523 8.040212 7.904940
10000 7.413678 7.4171096 7.2886915 7.341795 7.326141
Let’s explore now some of the phenotypic variables. Based on the metadata that is provided, we know what information is stored in each variable. After some visual inspection, we can see that the variables characteristics_ch1, characteristics_ch1.1, characteristics_ch1.2,characteristics_ch1.3 contain the information about stimulation, treatment, time, and cell line, respectively. These variables represent the relevant experimental conditions
table(se$characteristics_ch1) #stimulation
stimulation: TWEAK stimulation: UnStim
23 21
table(se$characteristics_ch1.1) #treatment
treatment: BIO032202 treatment: BIO0702697 treatment: BIO0919278
7 5 7
treatment: DMSO treatment: NTsiRNA treatment: siRNAs523626
5 8 5
treatment: siRNAs523629
7
table(se$characteristics_ch1.2) #time
time: 24h time: 24hr time: 4hr
0 22 22
table(se$characteristics_ch1.3) #cell line
cell line: U2OS
44
To facilitate the handling of these variables we are going to recode them as follows:
se$stimulation <- se$characteristics_ch1
se$treatment <- se$characteristics_ch1.1
se$time <- se$characteristics_ch1.2
se$cell_line <- se$characteristics_ch1.3
We can also identify some variables associated with technical factors, such as the sample preparation protocol in extract_protocol_ch1.
table(se$extract_protocol_ch1)
RNA libraries were prepared for sequencing using standard Illumina protocols
44
We can see that all samples were prepared using the same protocol. Because of there, there is unlikely to be any sort of variation in the data due to technical protocols.
Finally, we also observe that the variable description contains some relevant information about an apparent sub-grouping of the samples.
se$description
V2 V3 V6 V7 V8 V9
replicate A replicate A replicate A replicate A replicate A replicate A
V10 V11 V13 V14 V15 V17
replicate A replicate A replicate A replicate A replicate A replicate A
V20 V21 V23 V24 V25 V26
replicate B replicate B replicate B replicate B replicate B replicate B
V28 V29 V30 V31 V32 V33
replicate B replicate B replicate B replicate B replicate B replicate B
V34 V35 V36 V37 V38 V39
replicate A replicate A replicate A replicate A replicate A replicate A
V40 V41 V42 V44 V46 V47
replicate A replicate A replicate A replicate A replicate B replicate B
V49 V50 V51 V52 V53 V55
replicate B replicate B replicate B replicate B replicate B replicate B
V56 V57
replicate B replicate B
Levels: replicate A replicate B
We can see that there are biological replicates for each of the samples. Later on the analysis, we will determine whether these biological replicates may contribute to confounding batch effects.
In Table 1 below, we show a list of all samples and their corresponding experimental conditions in order to gain as much of an understaning as possible about the underlying experimental design.
| Identifer | Cell line | Time | Stimulation | Treatment | Replicate | |
|---|---|---|---|---|---|---|
| V2 | SRR7089369 | cell line: U2OS | time: 4hr | stimulation: UnStim | treatment: DMSO | replicate A |
| V3 | SRR7089370 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: DMSO | replicate A |
| V6 | SRR7089373 | cell line: U2OS | time: 4hr | stimulation: UnStim | treatment: BIO0919278 | replicate A |
| V7 | SRR7089374 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: BIO0919278 | replicate A |
| V8 | SRR7089375 | cell line: U2OS | time: 4hr | stimulation: UnStim | treatment: BIO032202 | replicate A |
| V9 | SRR7089376 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: BIO032202 | replicate A |
| V10 | SRR7089377 | cell line: U2OS | time: 24hr | stimulation: UnStim | treatment: DMSO | replicate A |
| V11 | SRR7089378 | cell line: U2OS | time: 24hr | stimulation: TWEAK | treatment: DMSO | replicate A |
| V13 | SRR7089380 | cell line: U2OS | time: 24hr | stimulation: TWEAK | treatment: BIO0702697 | replicate A |
| V14 | SRR7089381 | cell line: U2OS | time: 24hr | stimulation: UnStim | treatment: BIO0919278 | replicate A |
| V15 | SRR7089382 | cell line: U2OS | time: 24hr | stimulation: TWEAK | treatment: BIO0919278 | replicate A |
| V17 | SRR7089384 | cell line: U2OS | time: 24hr | stimulation: TWEAK | treatment: BIO032202 | replicate A |
| V20 | SRR7089387 | cell line: U2OS | time: 4hr | stimulation: UnStim | treatment: BIO0702697 | replicate B |
| V21 | SRR7089388 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: BIO0702697 | replicate B |
| V23 | SRR7089390 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: BIO0919278 | replicate B |
| V24 | SRR7089391 | cell line: U2OS | time: 4hr | stimulation: UnStim | treatment: BIO032202 | replicate B |
| V25 | SRR7089392 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: BIO032202 | replicate B |
| V26 | SRR7089393 | cell line: U2OS | time: 24hr | stimulation: UnStim | treatment: DMSO | replicate B |
| V28 | SRR7089395 | cell line: U2OS | time: 24hr | stimulation: UnStim | treatment: BIO0702697 | replicate B |
| V29 | SRR7089396 | cell line: U2OS | time: 24hr | stimulation: TWEAK | treatment: BIO0702697 | replicate B |
| V30 | SRR7089397 | cell line: U2OS | time: 24hr | stimulation: UnStim | treatment: BIO0919278 | replicate B |
| V31 | SRR7089398 | cell line: U2OS | time: 24hr | stimulation: TWEAK | treatment: BIO0919278 | replicate B |
| V32 | SRR7089399 | cell line: U2OS | time: 24hr | stimulation: UnStim | treatment: BIO032202 | replicate B |
| V33 | SRR7089400 | cell line: U2OS | time: 24hr | stimulation: TWEAK | treatment: BIO032202 | replicate B |
| V34 | SRR7089401 | cell line: U2OS | time: 4hr | stimulation: UnStim | treatment: NTsiRNA | replicate A |
| V35 | SRR7089402 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: NTsiRNA | replicate A |
| V36 | SRR7089403 | cell line: U2OS | time: 4hr | stimulation: UnStim | treatment: siRNAs523626 | replicate A |
| V37 | SRR7089404 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: siRNAs523626 | replicate A |
| V38 | SRR7089405 | cell line: U2OS | time: 4hr | stimulation: UnStim | treatment: siRNAs523629 | replicate A |
| V39 | SRR7089406 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: siRNAs523629 | replicate A |
| V40 | SRR7089407 | cell line: U2OS | time: 24hr | stimulation: UnStim | treatment: NTsiRNA | replicate A |
| V41 | SRR7089408 | cell line: U2OS | time: 24hr | stimulation: TWEAK | treatment: NTsiRNA | replicate A |
| V42 | SRR7089409 | cell line: U2OS | time: 24hr | stimulation: UnStim | treatment: siRNAs523626 | replicate A |
| V44 | SRR7089411 | cell line: U2OS | time: 24hr | stimulation: UnStim | treatment: siRNAs523629 | replicate A |
| V46 | SRR7089413 | cell line: U2OS | time: 4hr | stimulation: UnStim | treatment: NTsiRNA | replicate B |
| V47 | SRR7089414 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: NTsiRNA | replicate B |
| V49 | SRR7089416 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: siRNAs523626 | replicate B |
| V50 | SRR7089417 | cell line: U2OS | time: 4hr | stimulation: UnStim | treatment: siRNAs523629 | replicate B |
| V51 | SRR7089418 | cell line: U2OS | time: 4hr | stimulation: TWEAK | treatment: siRNAs523629 | replicate B |
| V52 | SRR7089419 | cell line: U2OS | time: 24hr | stimulation: UnStim | treatment: NTsiRNA | replicate B |
| V53 | SRR7089420 | cell line: U2OS | time: 24hr | stimulation: TWEAK | treatment: NTsiRNA | replicate B |
| V55 | SRR7089422 | cell line: U2OS | time: 24hr | stimulation: TWEAK | treatment: siRNAs523626 | replicate B |
| V56 | SRR7089423 | cell line: U2OS | time: 24hr | stimulation: UnStim | treatment: siRNAs523629 | replicate B |
| V57 | SRR7089424 | cell line: U2OS | time: 24hr | stimulation: TWEAK | treatment: siRNAs523629 | replicate B |
At this point, the experimental protocol can be easily understood for the data. There are seven different treatment levels (BIO032202, BIO0702697, BIO0919278, DMSO, NTsiRNA, siRNAs523626, siRNAs523629), two different stimulation levels (Unstim, TWEAK), and two different time periods for stimulation and treatment (24hr, 4hr). The cell line is consistent for all samples (U2OS). There is a biological replicate (but no technical replicates) for each sample as well.
We can generate subgrouping varibales for each of the experimental conditions with multiple levels (treatment, stimulation, time, replicate). This is done as follows:
se$samplegrouptreatment <- factor(sapply(strsplit(as.character(se$treatment),
"-"), function(x) x[1]))
table(se$samplegrouptreatment)
treatment: BIO032202 treatment: BIO0702697 treatment: BIO0919278
7 5 7
treatment: DMSO treatment: NTsiRNA treatment: siRNAs523626
5 8 5
treatment: siRNAs523629
7
se$samplegroupdescription <- factor(sapply(strsplit(as.character(se$description),
"-"), function(x) x[1]))
table(se$samplegroupdescription)
replicate A replicate B
22 22
se$samplegrouptime <- factor(sapply(strsplit(as.character(se$time),
"-"), function(x) x[1]))
table(se$samplegrouptime)
time: 24hr time: 4hr
22 22
se$samplegroupstim <- factor(sapply(strsplit(as.character(se$stimulation),
"-"), function(x) x[1]))
table(se$samplegroupstim)
stimulation: TWEAK stimulation: UnStim
23 21
se$samplegroupcell <- factor(sapply(strsplit(as.character(se$cell_line),
"-"), function(x) x[1]))
table(se$samplegroupcell)
cell line: U2OS
44
Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample, also known as library sizes, in increasing order with respect to replicate groups. Figure 2 is with respect to treatment group, Figure 3 with respect to time, and Figure 4 with respect to presence of stimulation. This will be done with respect to every subgroup created above in order to gain as much information as possible about the variation in the data.
Figure 1: Library sizes in increasing order (replicate group)
Figure 2: Library sizes in increasing order (treatment)
Figure 3: Library sizes in increasing order (time)
Figure 4: Library sizes in increasing order (stimulation)
We see fairly substantial differences in sequencing depth, ranging from 5.89 to 12.13 million reads. It appears that the experimental condition groups, for the most part, are not very related to sequencing depth. There may be one exception - it seems that the samples treated with BIO0919278 were sequenced at a lower depth compared to the other treatments. All of the BIO0919278 samples were sequenced at a depth lower than 8.5 million reads, and the histogram shows that these samples are heavily concentrated to the left.
Figure 5 below shows the distribution of expression values per sample in logarithmic CPM units of expression.
Figure 5: Non-parametric density distribution of expression profiles per sample
Figure 6 below shows the distribution of expression values per sample in logarithmic CPM units of expression.
Figure 6: Non-parametric boxplot of expression profiles per sample
The multidensity plot and boxplot reveal that there are not substantial differences between the samples in the distribution of expression values.
Let’s calculate now the average expression per gene through all the samples.
avgexp <- rowMeans(assays(se)$logCPM)
Figure 7 shows the distribution of those values across genes.
Figure 7: Distribution of average expression level per gene
As expected, we have two modes, one for genes that are lowly expressed in nearly all samples and another for genes with some detectable levels of expression across a number of samples.
We filter lowly-expressed genes using the function filterByExpr(), grouping by sample-group to define the minimum number of samples in which a gene should be expressed. I will choose to use the sample group organized by replicates (samplegroupdescription) since it will eliminate genes that were consistently lowly expressed.
mask <- filterByExpr(dge, group=se$samplegroupdescription)
se.filt <- se[mask, ]
dim(se.filt)
[1] 13228 44
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 13228 44
We are left with 13228 genes.
We will now calculate the normalization factors on the filtered expression data set.
dge.filt <- calcNormFactors(dge.filt)
The following code replaces the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object by the normalized ones.
assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE,
normalized.lib.sizes=TRUE)
We examine now the MA-plots of the normalized expression profiles in Figure 8.
Figure 8: MA-plots of filtered and normalized expression values
These MA plots plot one sample at a time against the average of the rest of the samples. It should be generally expected that the ratio between M (log ratio) and A (mean) average should be close to 1 since we assume only a small fraction of the genes will be differentially expressed between samples. After viewing the MA plots for each sample, we can see that there may exist intensity dependent bias for many of the samples. First of all, we should keep an eye on samples with these biases in case they also display other unexpected features, because then we might consider removing them. However, we already utilized a between sample normalization technique (TMM). The reason for this was to compensate for any systematic technical effects so that we could attribute the observed differences between the samples to a biological signal. Since the data was already normalized, perhaps the observed differences are due to a biological signal. This is a plausible explanation since this experiment includes so many different experimental conditions (treatment, stimulation, time) and it is likely that these conditions would induce signficant changes in gene expression levels. Regardless, we should continue to be aware of potential biases.
We have alerady described the experimental design above. All combinations of cell line, treatment, stimulation, and stimulation time have been sequenced. However, not every biological replicate for each unique combination was sequenced. For this reason, we can anticipate that it won’t be possible to identify expression changes associated with the replicate sample group. Additionally, though the experimental protocol was the same for the replicates, it is possible that there exist batch effects related to the replicates that we will be unable to discern. This means that all differences between replicate A and replicate B samples may not just be due to biological differences but also logistical (time of sequencing, etc.).
We examine now how samples group together by hierarchical clustering and multidimensional scaling, annotating sample group and treatment. We calculate again log CPM values with a high prior count(3) to moderate extreme fold-changes produced by low counts. Visualizing the clustering of samples separated by replicate group as well as experimental conditions should help us to estimate potential batch effects. The same procedure (shown below) is followed for each sample group:
logCPM <- cpm(dge.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(se.filt$samplegrouptreatment)
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.filt)
outcome <- paste(se.filt$treatment, colnames(se), sep="\n")
names(outcome) <- colnames(se.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
function(x, batch, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch, outcome)
The resulting dendrograms for each sample group are shown below. Figure 9 is with respect to treatment. Figure 10 is with respect to time. Figure 11 is with respect to stimulation. Figure 12 is with respect to replication.
Figure 9: Hierarchical clustering of the samples (treatment)
Figure 10: Hierarchical clustering of the samples (time)
Figure 11: Hierarchical clustering of the samples (stimulation)
Figure 12: Hierarchical clustering of the samples (replication group)
As expected, the samples for each of the different treatments tend to cluster separately from eachother. Additionally, it seems that samples for the two different stimulation/treatment times also cluster together. Therefore, it is a reasonable possibility that treatment and stimulation/treatment time seem to drive the largest portion of the variablity in the whole dataset. There is not much clustering for samples with respect to replicates or stimulation. However, since we know that the stimulant (TWEAK) activates noncanonical NF-kB signaling, we will continue to take into account the potential effects of stimulation on differential gene expression. Overall, these results may imply that there is not a strong batch effect related to replicates, as was previously hypothesized. In Figure 13 we show the corresponding MDS plot related to treatment and Figure 14 related to stimulation time, since those variables were the ones that showed the most significant clustering.
Figure 13: Multidimensional scaling plot of the samples (treatment)
Figure 14: Multidimensional scaling plot of the samples (time)
The MDS plot for tereatment shows clear differences between BIO0919278 treatments and all other treatments. It also reinforces the fact that there is clear clustering of samples by treatment. The MDS plot for stimulation time is a bit less clear. There is a clear difference for some of the treatments that were stimulated for 24 hrs, however, this is not consistent and there is not a strong clustering pattern with respect to stimulation time. It is a more reasonable explanation that the clustering seen from stimulation time can more accurately be attributed to treatment. For this reason, in this dataset it makes sense to compare different treatment conditions. Specifically, we are most interested in discerning the difference in gene expression between the two enantiomer treatments (BIO0919278 and BIO0702697). The MDS plot with respect to treatment indicates that we should expect differential gene expression between the two enantiomers.
We will first perform a simple assessment of the extent of expression changes and their associated p-values using the F-test implemented in the R/Bioconductor package sva. We compare treatment with BIO0919278 and its enantiomer BIO0702697. We first subset the data as follows:
se.filt.enants <- se.filt[, se.filt$treatment == "treatment: BIO0919278" | se.filt$treatment == "treatment: BIO0702697"]
se.filt.enants$treatment <- droplevels(se.filt.enants$treatment)
length(rownames(colData(se.filt.enants)))
[1] 12
se.filt.enants$treatment
V6 V7 V13
treatment: BIO0919278 treatment: BIO0919278 treatment: BIO0702697
V14 V15 V20
treatment: BIO0919278 treatment: BIO0919278 treatment: BIO0702697
V21 V23 V28
treatment: BIO0702697 treatment: BIO0919278 treatment: BIO0702697
V29 V30 V31
treatment: BIO0702697 treatment: BIO0919278 treatment: BIO0919278
Levels: treatment: BIO0702697 treatment: BIO0919278
In the second step above, we dropped the unused levels from the treatment factor variable. This is important to avoid using samples with treatments that do not exist in this subset of the data. We can see that this new dataset includes 12 samples, and looking at the treatment data we can see that 7 were treated with BIO0919278 and 5 were treated with BIO0702697.We can now build the corresponding full and null model matrices. Based on the results from the MDS plot earlier in the analysis, I have decided not to adjust for known covariates. The plots did not show any clear clustering patterns based on stimulation or time, and so I am opting to build my matricess with the unknown covariate methodology as follows:
mod <- model.matrix(~ se.filt.enants$treatment,
colData(se.filt.enants))
mod0 <- model.matrix(~ 1, colData(se.filt.enants))
Finally, we conduct the F-test implemented in the package sva and examine the amount of differential expression between treatment with BIO0919278 and its enantiomer BIO0702697. I have set the p values as 0.01 and 0.05 in order to reduce the number of DEGs reported since there are quite a high number.
library(sva)
pv <- f.pvalue(assays(se.filt.enants)$logCPM, mod, mod0)
sum(p.adjust(pv, method="fdr") < 0.01)
[1] 4080
sum(p.adjust(pv, method="fdr") < 0.05)
[1] 6386
We obtain 4080 differentially expressed genesn (DEGs) at FDR < 1% and 6386 at FDR < 5%. In Figure 15 below we can see the distribution of the resulting p-values.
Figure 15: Distribution of raw p-values for an F-test on every gene between treatment with BIO0919278 vs BIO0702697
The p-values are quite uniform, though there is some slight enrichment at lower p-values. This may indicate some potential confounding factors that are influencing the results. We build a table with the subset of DEGs with FDR < 5% and show the top-10 genes with lowest p-value in Table ?? below.
This was a very basic preliminary differential expression analysis. The analysis that we have conducted so far is “muddy” in the sense that it has failed to separate the samples by the presence of stimulation or other experimental conditions. In a biological sense, the presence or absence of stimulation should theoretically significantly affect the genes that are considered differentially expressed. The next step in the differential expression analysis will aim to more strongly account for confounding experimental conditions so that we can elucidate the true DEGs related to enantiomer treatment.
The next step in the differential expression analysis is to try to elucidate the DEGs that are relevant to the enantiomer treatments, while controlling for other experimental conditions (stimulation, time). Initially, we attempted to separate the samples into multiple comparison groups by both stimulation and time. However, it was found that this dramatically reduced the number of samples in comparison groups, with some groups having only one sample. This method signficantly reduced the statistical power, and we were not confident in the list of DEGs that we received. Instead, the analyses are separated by stimulation (TWEAK vs no stimulation) so that we can eliminate the effects of stimulationn conditions when searching for DEGs between the enantiomer treatments. Thinking biologically, we decided it was more important to separate groups based on the presence of stimulation rather than time, since the TWEAK stimulant directly effects the non-canonical signaling pathway of interest. Two different comparison groups are created (dgeenantstim and dgeenantnostim) for subsequent DGE analyses.
Further, since the number of samples in each comparison group is rather small (between 5 and 7), we need to utilize limited replication technique. The following procedure is applied:
# Subsetting data
levels(se.filt.enants$stimulation) <- sub("stimulation: TWEAK", "TWEAK", levels(se.filt.enants$stimulation))
levels(se.filt.enants$stimulation) <- sub("stimulation: UnStim", "UnStim", levels(se.filt.enants$stimulation))
levels(se.filt.enants$treatment) <- sub("treatment: BIO0919278", "BIO0919278", levels(se.filt.enants$treatment))
levels(se.filt.enants$treatment) <- sub("treatment: BIO0702697", "BIO0702697", levels(se.filt.enants$treatment))
se.filt.enants.tweak <- se.filt.enants[, se.filt.enants$stimulation == "TWEAK"]
se.filt.enants.nostim <- se.filt.enants[, se.filt.enants$stimulation == "UnStim"]
se.filt.enants.nostim$treatment <- droplevels(se.filt.enants.nostim$treatment)
se.filt.enants.tweak$treatment <- droplevels(se.filt.enants.tweak$treatment)
se.filt.enants.tweak$treatment
V7 V13 V15 V21 V23 V29 V31
BIO0919278 BIO0702697 BIO0919278 BIO0702697 BIO0919278 BIO0702697 BIO0919278
Levels: BIO0702697 BIO0919278
se.filt.enants.nostim$treatment
V6 V14 V20 V28 V30
BIO0919278 BIO0919278 BIO0702697 BIO0702697 BIO0919278
Levels: BIO0702697 BIO0919278
# Subsetting DGEList
dgeenantstim <- DGEList(counts=assays(se.filt.enants.tweak)$counts, genes=data.frame(Symbol=mcols(se.filt.enants.tweak)$symbol))
dgeenantstim <- calcNormFactors(dgeenantstim)
dgeenantnostim <- DGEList(counts=assays(se.filt.enants.nostim)$counts, genes=data.frame(Symbol=mcols(se.filt.enants.nostim)$symbol))
dgeenantnostim <- calcNormFactors(dgeenantnostim)
#Comparison of Enantiomer Treatments (TWEAK, no covariates)
mod2 <- model.matrix(~ se.filt.enants.tweak$treatment, colData(se.filt.enants.tweak))
dgeenantstim <- estimateDisp(dgeenantstim, mod2)
fit2 <- glmQLFit(dgeenantstim, mod2)
qlf2 <- glmQLFTest(fit2, coef=2)
tt2 <- topTags(qlf2, n=Inf)
mod2
(Intercept) se.filt.enants.tweak$treatmentBIO0919278
SRR7089374 1 1
SRR7089380 1 0
SRR7089382 1 1
SRR7089388 1 0
SRR7089390 1 1
SRR7089396 1 0
SRR7089398 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$`se.filt.enants.tweak$treatment`
[1] "contr.treatment"
#Comparison of Enantiomer Treatments (UnStim, no covariates)
mod3 <- model.matrix(~ se.filt.enants.nostim$treatment, colData(se.filt.enants.nostim))
dgeenantnostim <- estimateDisp(dgeenantnostim, mod3)
fit3 <- glmQLFit(dgeenantnostim, mod3)
qlf3 <- glmQLFTest(fit3, coef=2)
tt3 <- topTags(qlf3, n=Inf)
# Surrogate Variable Testing
mod01 <- model.matrix(~ 1, colData(se.filt.enants.tweak))
sv1 <- sva(assays(se.filt.enants.tweak)$logCPM, mod=mod2, mod0=mod01)
Number of significant surrogate variables is: 1
Iteration (out of 5 ):1 2 3 4 5
sv1$n
[1] 1
mod02 <- model.matrix(~ 1, colData(se.filt.enants.nostim))
sv2 <- sva(assays(se.filt.enants.nostim)$logCPM, mod=mod3, mod0=mod02)
Number of significant surrogate variables is: 1
Iteration (out of 5 ):1 2 3 4 5
sv2$n
[1] 1
#Comparison of Enantiomer Treatments (TWEAK, unknown covariates)
mod5 <- model.matrix(~ se.filt.enants.tweak$treatment, colData(se.filt.enants.tweak))
dgeenantstim <- estimateDisp(dgeenantstim, mod5)
fit5 <- glmQLFit(dgeenantstim, mod5)
qlf5 <- glmQLFTest(fit5, coef=2)
tt5 <- topTags(qlf5, n=Inf)
#Comparison of Enantiomer Treatments (UnStim, unknown covariates)
mod4 <- model.matrix(~ se.filt.enants.nostim$treatment, colData(se.filt.enants.nostim))
dgeenantnostim <- estimateDisp(dgeenantnostim, mod4)
fit4 <- glmQLFit(dgeenantnostim, mod4)
qlf4 <- glmQLFTest(fit4, coef=2)
tt4 <- topTags(qlf4, n=Inf)
We can see that the tweak comparison group includes 7 samples with 4 treated with BIO919278 and 3 treated with BIO0702697. The nostim comparison group includes 5 samples, 3 of which were treated with BIO919278 and 2 treated with BIO0702697. This analysis was conducted both with no covariates and unknown covariates. We attempted to adjust for known covariates (time), but there was an error that the design matrix was not of full rank. This is due to the fact that we are using a limited replication analysis and we simply do not have enough data points. There was only one surrogate variable presented from the surrogate variable testing, which led us to believe that adjusting for unknown covariates would not be as beneficial.
Below we see a preview of the most signicant DEGs found using the limited replication differential expression analysis with respect to TWEAK stimulation. The previews are presented in the same order as the code below. Additionally, we arw saving the DEGs with FDR < 0.01.
# Differential Expression Table (TWEAK)
head(tt2,10)
Coefficient: se.filt.enants.tweak$treatmentBIO0919278
Symbol logFC logCPM F PValue FDR
6752 SSTR2 -3.263151 4.722807 220.0516 4.201147e-07 0.002216083
643836 ZFP62 -2.698245 3.880877 192.3153 7.071488e-07 0.002216083
7741 ZSCAN26 -3.619425 2.603803 187.8349 7.744089e-07 0.002216083
1958 EGR1 4.861429 9.401018 185.2372 8.170829e-07 0.002216083
64783 RBM15 2.275947 6.348114 180.1252 9.100032e-07 0.002216083
9807 IP6K1 -2.208667 5.194190 167.0127 1.216537e-06 0.002216083
54838 WBP1L -2.303788 5.045818 166.2091 1.239248e-06 0.002216083
10363 HMG20A -2.032194 5.217478 154.0232 1.658651e-06 0.002216083
2736 GLI2 -2.255932 5.267824 149.2553 1.870308e-06 0.002216083
7779 SLC30A1 2.615566 8.220260 138.2207 2.506077e-06 0.002216083
sum(tt2$table$FDR < 0.01)
[1] 571
sum(tt2$table$FDR < 0.05)
[1] 3571
tweakDEG <- head(tt2, 571)
#Differential Expression Table (UnStim)
head(tt3, 10)
Coefficient: se.filt.enants.nostim$treatmentBIO0919278
Symbol logFC logCPM F PValue FDR
1958 EGR1 5.596727 9.471384 115.53702 7.867152e-06 0.06248152
1959 EGR2 5.092510 4.427130 94.29537 1.612773e-05 0.06248152
2296 FOXC1 3.025920 5.384855 76.69481 3.313976e-05 0.06248152
6242 RTKN -3.136723 3.422720 75.67917 3.470290e-05 0.06248152
3726 JUNB 5.394030 8.506892 70.92467 4.339895e-05 0.06248152
205428 C3orf58 2.842493 5.790873 68.28324 4.943551e-05 0.06248152
8553 BHLHE40 2.863960 8.232418 64.80958 5.909282e-05 0.06248152
93517 SDR42E1 -2.308680 4.039448 60.04808 7.658122e-05 0.06248152
2736 GLI2 -2.430199 5.426803 59.02270 8.117441e-05 0.06248152
64783 RBM15 2.184114 6.393603 58.57774 8.327602e-05 0.06248152
sum(tt3$table$FDR < 0.05)
[1] 0
sum(tt3$table$FDR < 0.1)
[1] 2190
#Differential Expression Table (TWEAK, unknown cov)
head(tt5,10)
Coefficient: se.filt.enants.tweak$treatmentBIO0919278
Symbol logFC logCPM F PValue FDR
6752 SSTR2 -3.263151 4.722807 220.0516 4.201147e-07 0.002216083
643836 ZFP62 -2.698245 3.880877 192.3153 7.071488e-07 0.002216083
7741 ZSCAN26 -3.619425 2.603803 187.8349 7.744089e-07 0.002216083
1958 EGR1 4.861429 9.401018 185.2372 8.170829e-07 0.002216083
64783 RBM15 2.275947 6.348114 180.1252 9.100032e-07 0.002216083
9807 IP6K1 -2.208667 5.194190 167.0127 1.216537e-06 0.002216083
54838 WBP1L -2.303788 5.045818 166.2091 1.239248e-06 0.002216083
10363 HMG20A -2.032194 5.217478 154.0232 1.658651e-06 0.002216083
2736 GLI2 -2.255932 5.267824 149.2553 1.870308e-06 0.002216083
7779 SLC30A1 2.615566 8.220260 138.2207 2.506077e-06 0.002216083
sum(tt5$table$FDR < 0.01)
[1] 571
sum(tt5$table$FDR < 0.05)
[1] 3571
#Differential Expression Table (UnStim, unknown cov)
head(tt4,10)
Coefficient: se.filt.enants.nostim$treatmentBIO0919278
Symbol logFC logCPM F PValue FDR
1958 EGR1 5.596727 9.471384 115.53702 7.867152e-06 0.06248152
1959 EGR2 5.092510 4.427130 94.29537 1.612773e-05 0.06248152
2296 FOXC1 3.025920 5.384855 76.69481 3.313976e-05 0.06248152
6242 RTKN -3.136723 3.422720 75.67917 3.470290e-05 0.06248152
3726 JUNB 5.394030 8.506892 70.92467 4.339895e-05 0.06248152
205428 C3orf58 2.842493 5.790873 68.28324 4.943551e-05 0.06248152
8553 BHLHE40 2.863960 8.232418 64.80958 5.909282e-05 0.06248152
93517 SDR42E1 -2.308680 4.039448 60.04808 7.658122e-05 0.06248152
2736 GLI2 -2.430199 5.426803 59.02270 8.117441e-05 0.06248152
64783 RBM15 2.184114 6.393603 58.57774 8.327602e-05 0.06248152
sum(tt4$table$FDR < 0.05)
[1] 0
sum(tt4$table$FDR < 0.10)
[1] 2190
We obtain 571 differentially expressed genes (DEGs) at FDR < 1% and 3571 at FDR < 5% with TWEAK stimulation. In Figure 16 below we can see the distribution of the resulting p-values.
We obtain 0 differentially expressed genes (DEGs) at FDR < 5% and 2190 at FDR < 10% with no stimulation. In Figure 17 below we can see the distribution of the resulting p-values.
We receive the exact same results using unknown covariates. Therefore, the rest of the analysis will only include the results from the no covariate group. Additionally, since there are 0 DEGs in the unstimulated condition group with FDR < 5%, we will exclude these results from further analysis. Further, it makes biological sense that there are significantly fewer DEGs in the unstimulated condition since only the stimulated conditions are directly affecting the non-canonical signaling pathway of interest.
Lastly, we have saved the DEGs with FDR < 0.01 to be utilized further in this differential expression analysis pipeline.
Figure 16: Distribution of raw p-values for a limited replication F-test on every gene between treatment with BIO0919278 vs BIO0702697 with TWEAK stimulation
Figure 17: Distribution of raw p-values for a limited replication F-test on every gene between treatment with BIO0919278 vs BIO0702697 with no stimulation
The p-value distributions for both the stimulated and unstimulated conditions look fairly similar. However, it appears that the p-values are slightly more uniformly distributed in the stimulated condition. This distinction provides further evidence for using DEGs from the stimulated comparison group rather than the unstimulated comparison group.
Below, a new data frame is created that includes the fold change values (Log2FC, FC, 1/FC) as well as statistical indicators (p-values, FDR). This data frame will be utilized to generate a diagnostic volcano plot further in the analysis pipeline. A preview of the table is presented below, and the following procedure was followed:
# Log Fold Analysis and Filtering
ranking2 <- order(abs(tt2[["table"]][["logFC"]]), decreasing=TRUE)
tab <- data.frame(Gene=rowData(se.filt.enants.tweak)$symbol[ranking2],
Log2FC=round(tt2[["table"]][["logFC"]][ranking2], digits=10),
FC=round(2^tt2[["table"]][["logFC"]][ranking2], digits=10),
"1/FC"=round(2^(-tt2[["table"]][["logFC"]][ranking2]), digits=10),
Pvalues=round(tt2[["table"]][["PValue"]][ranking2], digits=10),
FDR=round(tt2[["table"]][["FDR"]][ranking2], digits=10),
row.names=rownames(tt2),
check.names=FALSE)
head(tab)
Gene Log2FC FC 1/FC Pvalues FDR
6752 ZNF256 8.524302 368.1888 0.002715998 0.0002004338 0.007303074
643836 SPC24 7.946608 246.6989 0.004053524 0.0040187656 0.025111115
7741 CDC42EP2 7.849888 230.7022 0.004334592 0.0005264471 0.010963274
1958 SEMA4B 7.770743 218.3869 0.004579029 0.0006131388 0.011587581
64783 FNTA 7.194514 146.4753 0.006827088 0.0137626519 0.050668622
9807 GPR180 6.875140 117.3879 0.008518766 0.0051539447 0.028525682
Figure 18 shows a volcano plot of log fold change values by raw p-values between genes treated with BIO0919278 and genes treated with BIO0702697 in the TWEAK stimulation group.
Figure 18: Volcano plot of log fold change values by raw p-values for a limited replication F-test on every gene between treatment with BIO0919278 vs BIO0702697 with TWEAK stimulation
The volcano plot above includes a horizontal FDR cutoff line of FDR < 0.01. The points above this line represent the 571 significant differentially expressed genes previously discussed above. This plot shows that the majority of these significant DEGs have negative log fold change values. With respect to this set of DEGs, this indicates biologically that treatment with BIO919278 tends produce underexpressed genes relative to treatment with BIO0702697.
We will also build a table with the subset of DEGs with FDR < 1% and show the top-10 genes with lowest p-value in Table 2 below.
| Symbol | logFC | logCPM | F | PValue | FDR | |
|---|---|---|---|---|---|---|
| 6752 | SSTR2 | -3.263151 | 4.722807 | 220.0516 | 4.0e-07 | 0.0022161 |
| 643836 | ZFP62 | -2.698245 | 3.880877 | 192.3153 | 7.0e-07 | 0.0022161 |
| 7741 | ZSCAN26 | -3.619425 | 2.603803 | 187.8349 | 8.0e-07 | 0.0022161 |
| 1958 | EGR1 | 4.861429 | 9.401018 | 185.2372 | 8.0e-07 | 0.0022161 |
| 64783 | RBM15 | 2.275947 | 6.348114 | 180.1252 | 9.0e-07 | 0.0022161 |
| 9807 | IP6K1 | -2.208667 | 5.194191 | 167.0127 | 1.2e-06 | 0.0022161 |
| 54838 | WBP1L | -2.303788 | 5.045818 | 166.2091 | 1.2e-06 | 0.0022161 |
| 10363 | HMG20A | -2.032194 | 5.217478 | 154.0232 | 1.7e-06 | 0.0022161 |
| 2736 | GLI2 | -2.255932 | 5.267824 | 149.2553 | 1.9e-06 | 0.0022161 |
| 7779 | SLC30A1 | 2.615566 | 8.220260 | 138.2207 | 2.5e-06 | 0.0022161 |
Below, we are preparing the data for the GO analysis by isolating the rownmames from our DEG list.
DEGrows <- rownames(tweakDEG)
length(DEGrows)
[1] 571
There are 571 genes in the stimulated comparison with FDR < 0.01.
The goal of this analysis is to elucidate the biological functions associated with these genes. We are specifically interested in viewing DEGs related to the molecular processes examined in the original article (noncanonical NF-kB signaling pathway). The gene universe provided in the GO analysis includes the genes in the se.filt.enants data set. The resulting table below provides a preview of the results for the GO analysis. Additionally, we have created an HTML page with the full results of the analysis. The following procedure was followed:
library(org.Hs.eg.db)
allHumanGO <- select(org.Hs.eg.db, columns="GO", key=keys(org.Hs.eg.db, keytype="SYMBOL"), keytype="SYMBOL")
sort(table(allHumanGO$EVIDENCE), decreasing=TRUE)
IEA IDA IBA TAS ISS IMP IPI NAS HDA ND IGI ISA IC
77387 62996 54631 43106 24288 20213 15429 7682 7366 1836 1574 1445 1299
IEP ISM EXP RCA HMP HEP ISO
878 757 293 133 130 70 7
geneUniverse <- rownames(se.filt.enants)
# conditional non-filtered DEG
library(GOstats)
params1 <- new("GOHyperGParams", geneIds=DEGrows,
universeGeneIds=geneUniverse,
annotation="org.Hs.eg.db", ontology="BP",
pvalueCutoff=0.05, testDirection="over")
conditional(params1) <- TRUE
hgOverCond <- hyperGTest(params1)
goresults <- summary(hgOverCond)
goresults <- goresults[goresults$Size >= 3 & goresults$Size <= 500 & goresults$Count >= 3, ]
goresults <- goresults[order(goresults$OddsRatio, decreasing=TRUE), ]
geneIDs <- geneIdsByCategory(hgOverCond)[goresults$GOBPID]
geneSYMs <- sapply(geneIDs, function(id) select(org.Hs.eg.db, columns="SYMBOL", key=id, keytype="ENTREZID")$SYMBOL)
geneSYMs <- sapply(geneSYMs, paste, collapse=", ")
goresults <- cbind(goresults, Genes=geneSYMs)
rownames(goresults) <- 1:nrow(goresults)
library(kableExtra)
use_directory(file.path("doc"))
## generate full table in HTML and store it into the 'doc' directory
ktab <- kable(goresults, "html", caption="GO results for conditional analysis in TWEAK stimulated group (FDR < 0.01).")
ktab <- kable_styling(ktab, bootstrap_options=c("stripped", "hover", "responsive"), fixed_thead=TRUE)
fnameHTML <- "GOresults.html"
fpathHTML <- proj_path(file.path("doc", fnameHTML))
save_kable(ktab, file=fpathHTML, self_contained=TRUE)
browseURL("GOresults.html")
ktabshow <- kable(goresults[1:10,])
ktabshow <- kable_styling(ktabshow, bootstrap_options=c("stripped", "hover", "responsive"), fixed_thead=TRUE)
ktabshow
| GOBPID | Pvalue | OddsRatio | ExpCount | Count | Size | Term | Genes |
|---|---|---|---|---|---|---|---|
| GO:0007296 | 0.0003406 | 64.67251 | 0.1782999 | 3 | 4 | vitellogenesis | ETV6, ZMIZ1, SOS1 |
| GO:1903265 | 0.0002367 | 21.59375 | 0.3565999 | 4 | 8 | positive regulation of tumor necrosis factor-mediated signaling pathway | TRIM32, HSPA1A, HSPA1B, RIPK1 |
| GO:0061626 | 0.0026940 | 16.16374 | 0.3120249 | 3 | 7 | pharyngeal arch artery morphogenesis | FOLR1, BMPR1A, NOG |
| GO:0003197 | 0.0040787 | 13.03465 | 0.3538781 | 3 | 8 | endocardial cushion development | RBM24, FOXF1, GATA4 |
| GO:0003161 | 0.0041683 | 12.92982 | 0.3565999 | 3 | 8 | cardiac conduction system development | ID2, BMPR1A, MAML1 |
| GO:0010944 | 0.0041683 | 12.92982 | 0.3565999 | 3 | 8 | negative regulation of transcription by competitive promoter binding | CREB1, SMAD7, BHLHE41 |
| GO:0048617 | 0.0041683 | 12.92982 | 0.3565999 | 3 | 8 | embryonic foregut morphogenesis | FOXF1, GATA4, ACVR2B |
| GO:0070424 | 0.0041683 | 12.92982 | 0.3565999 | 3 | 8 | regulation of nucleotide-binding oligomerization domain containing signaling pathway | BIRC3, HSPA1A, HSPA1B |
| GO:0071550 | 0.0041683 | 12.92982 | 0.3565999 | 3 | 8 | death-inducing signaling complex assembly | TNFRSF1A, CASP8, RIPK1 |
| GO:1904748 | 0.0041683 | 12.92982 | 0.3565999 | 3 | 8 | regulation of apoptotic process involved in development | FOXC1, TNFRSF1A, VDR |
The original article is primarily concerned with the noncanonical NF-kB signaling pathway and ways in which this pathway is molecularly regulated. This pathway plays an important role in a wide range of critical biological functions, including development and homeostatic control of the immune system. The paper identified molecule BIO0919278 as an activator of this pathway via inhibiting CDK12. The primary focus of this analysis was determine gene expression differences between cellular treatment with BIO0919278 and its (S) enantiomer BIO0702697 and uncover biological functions related to those differentially expressed genes.
The GO results from this analysis seem to provide reasonable evidence that verify some of the results claimed in the original paper. Since BIO0702697 was shown by the authors to be significantly less potent compared to BIO0919278, it was intended to serve somewhat as a negative control. An initial scan of the GO results reveal many biological pathways directly associated with various forms of cellular development development (regulation of apoptotic process involved in development, cardiac conduction system development, endocardial cushion development, notochord development, metanephric mesenchyme development, etc.). These results, though general, should be expected since BIO919278 activates noncanonical NF-kB signaling, which is crucial in various development pathways. Additionally, these GO results represent genes that are overexpressed in BIO0919278 relative to BIO702697. Therefore, the presence of developmental pathways in the results provides support for the idea that BIO0919278 is a much more potent activator of the noncanonical NF-kB signaling pathway relative to its (S) enantiomer.
On a more specific level, the authors were particularly interested in RNA polymerase (Pol) II–mediated transcription and the role in which BIO0919278 may play in this pathway. It was hypothesized that BIO0919278 targets and inhibits CDK12, which itself has been previously shown to activate RNA polymerase (Pol) II–mediated transcription. The GO results identify the RNA polymerase (Pol) II–mediated transcription pathway as signficant with respect to BIO0919278 with an odds ratio = 3.08. This provides concrete evidence for the role of BIO0919278 in targeting CDK12 and ultimately regulating this pathway. This pathway is especially important since many previous studies have identified CDK12 as an emerging therapeutic target for cancer due to its role in regulating cellular transcription.
Overall, the functional enrichment results from this analysis serve to verify the results presented in the initial article. There is a growing body of evidence that BIO0919278 ia an inhibitor of CDK12, and as such could potentially be a compound of interest in developing cancer therapeutics. Further research must be conducted in order to more accurately assess cellular changes induced by BIO0919278. Such experiments should include multiple types of analyses (RNASeq transcriptome analysis, chemoproteomics, confocal microscopy, etc.) in order to quantitatively classify the extent of BIO0919278’s involvement in the aforementioned biological pathways.
sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS/LAPACK: /Users/zacharycollester/opt/miniconda3/envs/rbase/lib/libopenblasp-r0.3.9.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GO.db_3.11.4 GOstats_2.54.0
[3] graph_1.66.0 Category_2.54.0
[5] Matrix_1.2-18 org.Hs.eg.db_3.11.4
[7] sva_3.36.0 BiocParallel_1.22.0
[9] genefilter_1.70.0 mgcv_1.8-31
[11] nlme_3.1-148 geneplotter_1.66.0
[13] annotate_1.66.0 XML_3.99-0.3
[15] AnnotationDbi_1.50.0 lattice_0.20-41
[17] edgeR_3.30.0 limma_3.44.1
[19] SummarizedExperiment_1.18.1 DelayedArray_0.14.0
[21] matrixStats_0.56.0 Biobase_2.48.0
[23] GenomicRanges_1.40.0 GenomeInfoDb_1.24.0
[25] IRanges_2.22.1 S4Vectors_0.26.0
[27] BiocGenerics_0.34.0 usethis_1.6.1
[29] kableExtra_1.1.0 knitr_1.28
[31] BiocStyle_2.16.0
loaded via a namespace (and not attached):
[1] bitops_1.0-6 fs_1.4.1 bit64_4.0.2
[4] webshot_0.5.2 RColorBrewer_1.1-2 httr_1.4.1
[7] rprojroot_1.3-2 Rgraphviz_2.32.0 tools_4.0.0
[10] backports_1.1.7 R6_2.4.1 KernSmooth_2.23-17
[13] DBI_1.1.0 colorspace_1.4-1 bit_4.0.4
[16] compiler_4.0.0 cli_2.0.2 rvest_0.3.5
[19] xml2_1.3.2 bookdown_0.20 scales_1.1.1
[22] readr_1.3.1 RBGL_1.64.0 stringr_1.4.0
[25] digest_0.6.25 rmarkdown_2.3 AnnotationForge_1.30.1
[28] XVector_0.28.0 pkgconfig_2.0.3 htmltools_0.4.0
[31] highr_0.8 rlang_0.4.6 rstudioapi_0.11
[34] RSQLite_2.2.0 RCurl_1.98-1.2 magrittr_1.5
[37] GenomeInfoDbData_1.2.3 Rcpp_1.0.4.6 munsell_0.5.0
[40] fansi_0.4.1 lifecycle_0.2.0 stringi_1.4.6
[43] yaml_2.2.1 zlibbioc_1.34.0 grid_4.0.0
[46] blob_1.2.1 crayon_1.3.4 splines_4.0.0
[49] hms_0.5.3 locfit_1.5-9.4 pillar_1.4.4
[52] glue_1.4.1 evaluate_0.14 BiocManager_1.30.10
[55] vctrs_0.3.0 assertthat_0.2.1 xfun_0.14
[58] xtable_1.8-4 survival_3.1-12 viridisLite_0.3.0
[61] tibble_3.0.1 memoise_1.1.0 ellipsis_0.3.1
[64] GSEABase_1.50.0
Henry, Kate L., Debra Kellner, Bekim Bajrami, John E. Anderson, Mercedes Beyna, Govinda Bhisetti, Tom Cameron, et al. 2018. “CDK12-Mediated Transcriptional Regulation of Noncanonical Nf-κB Components Is Essential for Signaling.” Science Signaling 11 (541). https://doi.org/10.1126/scisignal.aam8216.
Maglott, Donna, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. 2010. “Entrez Gene: Gene-Centered Information at Ncbi.” Nucleic Acids Research 39 (suppl_1): D52–D57. https://doi.org/10.1093/nar/gkq1237.
Ziemann, Mark, Antony Kaspi, and Assam El-Osta. 2019. “Digital Expression Explorer 2: A Repository of Uniformly Processed Rna Sequencing Data.” GigaScience 8 (4): giz022. https://doi.org/10.1093/gigascience/giz022.